Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector

نویسندگان

  • Óscar Varela
  • Rubén San-Segundo-Hernández
  • Luís A. Hernández
چکیده

Nowadays, several computational techniques for speech recognition have been proposed. These techniques suppose an important improvement in real time applications where speaker interacts with speech recognition systems. Although researchers proposed many methods, none of them solve the high false alarm problem when far-field speakers interfere in a human-machine conversation. This paper presents a two-class (speech and non-speech classes) decision-tree based approach for combining new speech pulse features in a VAD (Voice Activity Detector) for rejecting far-field speech in speech recognition systems. This decision tree is applied over the speech pulses obtained by a baseline VAD composed of a frame feature extractor, a HMMbased (Hidden Markov Model) segmentation module and a pulse detector. The paper also presents a detailed analysis of a great amount of features for discriminating between close and far-field speech. The detection error obtained with the proposed VAD is the lowest compared to other well-known VADs. Key Words— Voice Activity Detector, Decision Tree, Hidden Markov Model, Cepstrum, Auto-correlation and Linear Prediction Coefficients (LPC). 1.Introduction The advantages of using Automatic Speech Recognition are obvious for several types of applications. Speech Recognition becomes difficult when the main speaker is in noisy environments, for example in bars, where many far-field speakers are speaking almost all the time. This factor contributes to a reduction in the speech recognizer success rate that can lead to an unsatisfactory experience for the user. If there are too many recognition mistakes, the user is forced to correct the system which takes too long, it is a nuisance, and the user will finally reject the system. With the purpose of solving this problem a Robust Voice Activity Detector is proposed in this work. The VAD is able to select speech frames (noise frames are discarded). This frame information is sent to the Speech Recognizer and only speech pronunciations are processed, so the VAD tries to avoid Speech Recognizer mistakes coming from noisy frames. If the VAD works well, the Speech Recognizer does too. In summary, it is very common to find, in mobile phone scenarios, many situations in which the target speaker is situated in open environments surrounded by far-field interfering speech from other speakers. In this ambiguous case, VAD systems can detect far-field speech as coming from the user, increasing the speech recognition error rate. Generally, detection errors caused by background voices mainly increase word insertions and substitutions, leading to significant dialogue misunderstandings. This work tries to solve these speech-based application problems in which far-field speech can be wrongly considered as main speaker speech. In [1] a spectrum sensing scheme to detect the presence of the primary user for cognitive radio systems is proposed (very similar to the VAD proposed in this paper) being able to distinguish between main speaker speech and far-field speech. Moreover the system implemented in [1] uses one-order feature detection and compare its results with an energy detector showing relevant improvement. In our work a comparative study is done too, comparing our proposal to other well known VADs: AURORA(FD), AMR1, AMR2 or G729 annex b. Another recent work is [2], where authors use the pitch lag as feature to achieve better speech quality in the AMR codec. We also use indirectly the pitch to improve Voice Activity Detection results: considering the maximum auto-correlation value when computing the pitch. In the same way, in [3] the authors use a threshold selection algorithm applied to different speech signal features for improving a speech-based system. Finally, there are some kinds of learning schemes, for example, in [4] the authors train a neural network in order to obtain the best system response, or [5] presents a new way of computing the weights for combining multiple neural network classifiers based on Particle Swarm Optimization, PSO. In this paper, a decision tree is trained for rejecting far field speech. On the other hand, and considering VAD systems in real time applications, new VAD techniques are being proposed. See for example the work of Ramirez [6] based on robust VAD using the Kullback-Leibler divergence measure. In [7] a classification SVM (Support Vector Machine) technique for VAD is presented. This SVM uses only MFCCs (Mel-Frequency Cepstral Coefficients) as features. The segmentation and training method is based on HMM models, similar to the baseline VAD of this work. During the detection process, the incoming signals are classified into three distinctive and consecutive states representing the pre-silence, speech and post-silence segments respectively. However, although experimental results are usually given for the AURORA database, from our knowledge there are no similar results for speech in the presence of far-field voices. In several previous works, similar measurements, like those considered in this work, have been used for dereverberation techniques. In [8] for example, the authors use the idea of reverberation for restoring speech degraded by room acoustics using stereo (two microphone) measurements. To do this, cepstra operations are carried out when observations have non-vanishing spectra. Another dereverberation technique, presented in [9], uses the pitch as the primary analysis feature. That method starts by estimating the pitch and harmonic structure of the speech signal to obtain a dereverberation operator. After that, this operator is used to enhance the signal by means of an inverse filtering operation. Single channel blind dereverberation was proposed in [10] based on auto-correlation functions of frame-wise time sequences for different frequency components. A technique for reducing room reverberation using complex cepstral deconvolution and the behaviour of room impulse responses was presented in [11]. Reverberation reduction using least square inverse filtering has been also used to recover clean speech from reverberant speech. Yegnanarayana shows in [12] a method to extract time-delay between two speech signals collected at two microphone locations. The time-delay is estimated using short-time spectral information (magnitude, phase or both) based on the different behaviour of the speech spectral features affected by noise and reverberation degradations. Finally, Cournapeau shows in [13] a VAD based on High Order Statistics to discriminate close and far-field speech, enhanced by the auto-correlation of LPC residual. Although the authors use autocorrelation and LPC residual, they do not use these two features as a technique for far-field voice exclusion, as proposed in this paper. Other works are focused in some kind of application in which background voices are involved. For example, in [14] Thilo presents a post process technique for recording the main speaker in a meeting. In this case the feature vector contains loudness values of 20 critical bands up to 8 KHz, energy, total loudness, zerocrossing rate and the difference between the channel specific energy and the mean of the far-field microphone energies. The problem with these kinds of words is that the proposed techniques require several microphones, different than our case of study in which only one channel, or one microphone, is available (in the context of telephone applications). This paper proposes a new approach, combining specific pulse-based measurements in a Decision Tree method, to improve VAD systems in the presence of background speech (coming from one or several background speakers). These measurements are easy and cost-effective to integrate into state-of-the art VADs. Decision Tree process over the new measurements has been incorporated into the Speech Pulse Detection module of a HMM-based VAD. The paper is organized as follows: the baseline VAD is described in Section 2. Section 3 shows the speech database and the feature analysis, Section 4 describes the Decision Tree for combining the studied features and Section 5 presents global detection results when comparing our new approach to other well known VADs over three real mobile telephone databases. Finally, the main conclusions are presented in Section 6. 2.Baseline Voice Activity Detector Structure The baseline VAD is composed of three main modules (Fig.1): The first one is the feature vector extraction, the second is the HMM-based algorithm, and finally the third is the Pulse Detector implemented as a finite state machine. Feature Vector Extraction HMM-based algorithm State Machine Speech Pulse Structure Speech 4-state HMM Noise 3-state HMM Fig. 1. Voice Activity Detector Block Diagram. 2.1. Feature Vector Extraction The feature vector v(n) is composed by five features as shown in Fig.2.

منابع مشابه

A New Algorithm for Voice Activity Detection Based on Wavelet Packets (RESEARCH NOTE)

Speech constitutes much of the communicated information; most other perceived audio signals do not carry nearly as much information. Indeed, much of the non-speech signals maybe classified as ‘noise’ in human communication. The process of separating conversational speech and noise is termed voice activity detection (VAD). This paper describes a new approach to VAD which is based on the Wavelet ...

متن کامل

Maximum likelihood endpoint detection with time-domain features

In this paper we propose an effective, robust and computationally low-cost HMM-based start-endpoint detector for speech recognisers. Our first attempts follow the classical scheme feature extractor-Viterbi classifier (used for voice activity detection), followed by a post-processing stage, but the ultimate goal we pursue is a pure HMM-based architecture capable of performing the endpointing tas...

متن کامل

Speech enhancement based on hidden Markov model using sparse code shrinkage

This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...

متن کامل

Speech detection for text-dependent speaker verification

The performance of text-dependent speaker verification systems degrades in noisy environment and when the true speaker utters words that are not part of the verification password. Energy-based voice activity detection (VAD) algorithms cannot distinguish between the true speaker’s speech and other background speech or between the speaker’s verification password and other words uttered by the spe...

متن کامل

آشکارسازی و تعیین مکان متون فارسی - عربی در تصاویر ویدیویی

Video text detection plays an important role in applications such as semantic-based video analysis, text information retrieval, archiving and so on. In this paper, we propose a Farsi/Arabic text detection approach. First, with an appropriate edge detector, edges are extracted and then by using edges cross ponts, artificial corners are extracted. Artificial corner histogram analysis is done for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • Computers & Electrical Engineering

دوره 37  شماره 

صفحات  -

تاریخ انتشار 2011